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5' ESTS AND ENCODED HUMAN PROTEINS 



Back ground of the Invention 
The estimated 50,000-100,000 genes scattered along the human chromosomes offer tremendous 
5 promise for the understanding, diagnosis, and treatment of human diseases. In addition, probes capable 
of specifically hybridizing to loci distributed throughout the human genome find applications in the 
construction of high resolution chromosome maps and in the identification of individuals. 

In the past, the characterization of even a single human gene was a painstaking process, 
requiring years of effort. Recent developments in the areas of cloning vectors, DNA sequencing, and 
10 computer technology have merged to greatly accelerate the rate at which human genes can be isolated, 
sequenced, mapped, and characterized. 

Currently, two different approaches are being pursued for identifying and characterizing the 
genes distributed along the human genome. In one approach, large fragments of genomic DNA are 
isolated, cloned, and sequenced. Potential open reading frames in these genomic sequences are 
15 identified using bioinformatics software. However, this approach entails sequencing large stretches 
of human DNA which do not encode proteins in order to find the protein encoding sequences 
scattered throughout the genome. In addition to requiring extensive sequencing, the bioinformatics 
software may mischaracterize the genomic, sequences obtained, i.e., labeling non-coding DNA as 
coding DNA and vice versa. 
20 An alternative approach takes a more direct route to identifying and characterizing human 

genes. In this approach, complementary DNAs (cDNAs) are synthesized from isolated messenger 
RNAs (mRNAs) which encode human proteins. Using this approach, sequencing is only performed on 
DNA which is derived from protein coding portions of the genome. Often, only short stretches of the 
cDNAs are sequenced to obtain sequences called expressed sequence tags (ESTs). The ESTs may then 
25 be used to isolate or purify extended cDNAs which include sequences adjacent to the EST sequences. 
The extended cDNAs may contain all of the sequence of the EST which was used to obtain them or only 
a portion of the sequence of the EST which was used to obtain them. In addition, the extended cDNAs 
may contain the full coding sequence of the gene from which the EST was derived or, alternatively, the 
extended cDNAs may include portions of the coding sequence of the gene from which the EST was 
30 derived. It will be appreciated that there may be several extended cDNAs which include the EST 
sequence as a result of alternate splicing or the activity of alternative promoters. Alternatively, ESTs 
having partially overlapping sequences may be identified and contigs comprising the consensus 
sequences of the overlapping ESTs may be identified. 

In the past, these short EST sequences were often obtained from oligo-dT primed cDNA 
35 libraries. Accordingly, they mainly corresponded to the 3' untranslated region of the mRNA. In part, 
the prevalence of EST sequences derived from the 3' end of the mRNA is a result of the fact that typical 
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techniques for obtaining cDNAs, are not well suited for isolating cDNA sequences derived from the 5' 
ends ofmRNAs (Adams et al, Nature 377:3-174, 1996,Hillier<?/a/., GenomeRes. 6:807-828, 1996). 

In addition, in those reported instances where longer cDNA sequences have been obtained, the 
reported sequences typically correspond to coding sequences and do not include the fiill 5' untranslated 
5 region (5 'UTR) of the mRNA from which the cDN A is derived. Indeed, 5 'UTRs have been shown to 
affect either the stability or translation ofmRNAs. Thus, regulation of gene expression may be achieved 
through the use of alternative 5'UTRs as shown, for instance, for the translation of the tissue inhibitor 
of metalloprotease mRNA in mitogenically activated cells (Waterhouse et al, J Biol Chem. 265:5585- 
9. 1990). Furthermore, modification of 5'UTR through mutation, insertion or translocation events 
10 may even be implied in pathogenesis. For instance, the fragile X syndrome, the most common cause 
of inherited mental retardation, is partly due to an insertion of multiple CGG trinucleotides in the 
5'UTR of the fragile X mRNA resulting in the inhibition of protein synthesis via ribosome stalling 
(Feng et al, Science 268:731-4, 1995). An aberrant mutation in regions of the 5'UTR known to 
inhibit translation of the proto-oncogene c-myc was shown to result in upregulation of c-myc protein 
15 levels in cells derived from patients with multiple myelomas (Willis et al, Curr Top Microbiol 

Immunol 224:269-76, 1997). In addition, the use of oligo-dT primed cDNA libraries does not allow the 
isolation of complete 5'UTRs since such incomplete sequences obtained by this process may not include 
the first exon of the mRNA, particularly in situations where the first exon is short. Furthermore, they 
may not include some exons, often short ones, which are located upstream of splicing sites. Thus, there 
20 is a need to obtain sequences derived from the 5' ends ofmRNAs. 

While many sequences derived from human chromosomes have practical applications, 
approaches based on the identification and characterization of those chromosomal sequences which 
encode a protein product are particularly relevant to diagnostic and therapeutic uses. In some instances, 
the sequences used in such therapeutic or diagnostic techniques may be sequences which encode proteins 
25 which are secreted from the cell in which they are synthesized. Those sequences encoding secreted 
proteins as well as the secreted proteins themselves, are particularly valuable as potential therapeutic 
agents. Such proteins are often involved in cell to cell communication and may be responsible for 
producing a clinically relevant response in their target cells. In fact, several secretory proteins, including 
tissue plasminogen activator, G-CSF, GM-CSF, erythropoietin, human growth hormone, insulin, 
30 interferon-a, interferon-p, interferon-y, and interleukin-2, are currently in clinical use. These proteins 
are used to treat a wide range of conditions, including acute myocardial infarction, acute ischemic stroke, 
anemia, diabetes, growth hormone deficiency, hepatitis, kidney carcinoma, chemotherapy-induced 
neutropenia and multiple sclerosis. For these reasons, extended cDNAs encoding secreted proteins or 
portions thereof represent a valuable source of therapeutic agents. Thus, there is a need for the 
3 5 identification and characterization of secreted proteins and the nucleic acids encoding them. 

In addition to being therapeutically useful themselves, secretory proteins include short peptides, 
called signal peptides, at their amino termini which direct their secretion. These signal peptides are 
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encoded by the signal sequences located at the 5' ends of the coding sequences of genes encoding 
secreted proteins. These signal peptides can be used to direct the extracellular secretion of any protein to 
which they are operably linked. In addition, portions of the signal peptides called membrane- 
translocating sequences, may also be used to direct the intracellular import of a peptide or protein of 
5 interest. This may prove beneficial in gene therapy strategies in which it is desired to deliver a particular 
gene product to cells other than the cells in which it is produced. Signal sequences encoding signal 
peptides also find application in simplifying protein purification techniques. In such applications, the 
extracellular secretion of the desired protein greatly facilitates purification by reducing the number of 
undesired proteins from which the desired protein must be selected. Thus, there exists a need to identify 
10 and characterize the 5' portions of the genes for secretory proteins which encode signal peptides. 

Sequences coding for non-secreted proteins may also find application as therapeutics or 
diagnostics. In particular, such sequences may be used to determine whether an individual is likely to 
express a detectable phenotype, such as a disease, as a consequence of a mutation in the coding sequence 
of a protein. In instances where the individual is at risk of suffering from a disease or other undesirable 
1 5 phenotype as a result of a mutation in such a coding sequence, the undesirable phenotype may be 

corrected by introducing a normal coding sequence using gene therapy. Alternatively, if the undesirable 
phenotype results from overexpression of the protein encoded by the coding sequence, expression of the 
protein may be reduced using antisense or triple helix based strategies. 

The secreted or non-secreted human polypeptides encoded by the coding sequences may also be 
20 used as therapeutics by administering them directly to an individual having a condition, such as a 
disease, resulting from a mutation in the sequence encoding the polypeptide. In such an instance, the 
condition can be cured or ameliorated by administering the polypeptide to the individual. 

In addition, the secreted or non-secreted human polypeptides or portions thereof may be used to 
generate antibodies useful in determining the tissue type or species of origin of a biological sample. The 
25 antibodies may also be used to determine the cellular localization of the secreted or non-secreted human 
polypeptides or the cellular localization of polypeptides which have been fused to the human 
polypeptides. In addition, the antibodies may also be used in immunoaffinity chromatography 
techniques to isolate, purify, or enrich the human polypeptide or a target polypeptide which has been 
fused to the human polypeptide. 
30 Public information on the number of human genes for which the promoters and upstream 

regulatory regions have been identified and characterized is quite limited. In part, this may be due to the 
difficulty of isolating such regulatory sequences. Upstream regulatory sequences such as transcription 
factor binding sites are typically too short to be utilized as probes for isolating promoters from human 
genomic libraries. Recently, some approaches have been developed to isolate human promoters. One of 
35 them consists of making a CpG island library (Cross et aL Nature Genetics 6: 236-244, 1994). The 
second consists of isolating human genomic DNA sequences containing Spel binding sites by the use of 
Spel binding protein. (Mortlock et aL Genome Res. 6:327-335, 1 996). Both of these approaches have 
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their limits due to a lack of specificity and of comprehensiveness. Thus, there exists a need to identify 

and systematically characterize the 5' portions of the genes. 

The present 5* ESTs may be used to efficiently identify and isolate 5'UTRs and upstream 

regulatory regions which control the location, developmental stage, rate, and quantity of protein 
5 synthesis, as well as the stability of the mRNA. Once identified and characterized, these regulatory 

regions may be utilized in gene therapy or protein purification schemes to obtain the desired amount and 

locations of protein synthesis or to inhibit, reduce, or prevent the synthesis of undesirable gene products. 
In addition, ESTs containing the 5' ends of protein genes may include sequences useful as 

probes for chromosome mapping and the identification of individuals. Thus, there is a need to identify 
1 0 and characterize the sequences upstream of the 5 ' coding sequences of genes. 

Summary of the Invention 
The present invention relates to purified, isolated, or enriched 5 s ESTs which include sequences 
derived from the authentic 5' ends of their corresponding mRNAs. The term "corresponding mRNA" 

1 5 refers to the mRNA which was the template for the cDN A synthesis which produced the 5 ' EST. These 
sequences will be referred to hereinafter as "5' ESTs." The present invention also includes purified, 
isolated or enriched nucleic acids comprising contigs assembled by determining a consensus sequences 
from a plurality of ESTs containing overlapping sequences. These contigs will be referred to herein as 
"consensus contigated 5* ESTs." 

20 As used herein, the term "purified" does not require absolute purity; rather, it is intended as a 

relative definition. Individual 5' EST clones isolated from a cDNA library have been conventionally 
purified to electrophoretic homogeneity. The sequences obtained from these clones could not be 
obtained directly either from the library or from total human DNA. The cDNA clones are not naturally 
occurring as such, but rather are obtained via manipulation of a partially purified naturally occurring 

25 substance (messenger RNA). The conversion of mRNA into a cDNA library involves the creation of a 
synthetic substance (cDNA) and pure individual cDNA clones can be isolated from the synthetic library 
by clonal selection. Thus, creating a cDNA library from messenger RNA and subsequently isolating 
individual clones from that library results in an approximately 10 4 -10 6 fold purification of the native 
message. Purification of starting material or natural material to at least one order of magnitude, 

30 preferably two or three orders, and more preferably four or five orders of magnitude is expressly 
contemplated. 

As used herein, the term "isolated" requires that the material be removed from its original 
environment (e.g., the natural environment if it is naturally occurring). For example, a naturally- 
occurring polynucleotide present in a living animal is not isolated, but the same polynucleotide, 
35 separated from some or all of the coexisting materials in the natural system, is isolated. 

As used herein, the term "recombinant" means that the 5' EST is adjacent to "backbone" nucleic 
acid to which it is not adjacent in its natural environment. Additionally, to be "enriched" the 5' ESTs will 
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Brief Description of the Drawings 
Figure 1 is a summary of a procedure for obtaining cDNAs which have been selected to include 
the 5' ends of the mRNAs from which they derived. In the first step (1), the cap of intact mRNAs is 
5 oxidized to be chemically ligated to an oligonucleotide tag. In the second step (2), a reverse transcription 
is performed using random primers to generate a first cDNA strand. In the third step (3), mRNAs are 
eliminated and the second strand synthesis is carried out using a primer contained in the oligonucleotide 
tag. 

Figure 2 is an analysis of the 43 amino terminal amino acids of all human SwissProt proteins to 
10 determine the frequency of false positives and false negatives using the techniques for signal peptide 
identification described herein. 

Figure 3 summarizes a general method used to clone and sequence extended cDNAs containing 

sequences adjacent to 5'ESTs. 

Figure 4 provides a schematic description of the promoters isolated and the way they are 

1 5 assembled with the corresponding 5 ' tags. 

Figure 5 describes the transcription factor binding sites present in each of the promoters of 

Figure 4. 

Figure 6 is a block diagram of an exemplary computer system. 

Figure 7 is a flow diagram illustrating one embodiment of a process 200 for comparing a new 
20 nucleotide or protein sequence with a database of sequences in order to. determine the homology levels 
between the new sequence and the sequences in the database. 

Figure 8 is a flow diagram illustrating one embodiment of a process 250 in a computer for 
determining whether two sequences are homologous. 

Figure 9 is a flow diagram illustrating one embodiment of an identifier process 300 for 
25 detecting the presence of a feature in a sequence. 

Figure 10 is a table with all of the parameters that can be used for each step of extended cDNA 

analysis. 

Detailed Description of the Preferred Embodiment 
30 I. Obtaining 5'ESTs from cDNA libraries including the 5'Ends of their Corresponding mRNAs 

The 5' ESTs of the present invention were obtained from cDNA libraries including cDNAs 
which include the 5'end of their corresponding mRNAs. The general method used to obtain such cDNA 
libraries is described in Examples 1 to 5. 

EXAMPLE 1 

35 Preparation of mRNA 

Total human RN As or polyA + RNAs derived from 29 different tissues were respectively 
purchased from LABIMO and CLONTECH and used to generate 44 cDNA libraries as described below. 
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1 . A purified nucleic acid comprising a sequence selected from the group consisting of SEQ 
ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622 and sequences complementary to the sequences of 

5 SEQ ID NOs. 24-811 and SEQ ID NOs. 1600-1622. 

2. A purified nucleic acid comprising at least 15 consecutive nucleotides of a sequence 
selected from the group consisting of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622 and 
sequences complementary to the sequences of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622. 

10 

3. A purified or isolated polypeptide comprising a sequence selected from the group 
consisting of the sequences of SEQ ID NOs. 812-1599. 

4. A method of making a cDNA comprising the steps of: 

1 5 a) contacting a collection of mRN A molecules from human cells with a primer 

comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of 
the sequences complementary to SEQ ID NOs. 24-8 1 1 and SEQ ID NOs. 1600-1622; 

b) hybridizing said primer to an mRNA in said collection that encodes said protein; 

c) reverse transcribing said hybridized primer to make a first cDNA strand from said 

20 mRNA; 

d) making a second cDN A strand complementary to said first cDNA strand; and 

e) isolating the resulting cDN A comprising said first cDNA strand and said second 
cDNA strand. 

25 5. A method of making a cDNA comprising the steps of: 

a) obtaining a cDNA comprising a sequence selected from the group consisting of SEQ 
ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622; 

b) contacting said cDNA with a detectable probe comprising at least 15 consecutive 
nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-811 and SEQ ID 

30 NOs. 1600-1622 and the sequences complementary to SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600- 
1622 under conditions which permit said probe to hybridize to said cDNA; 

c) identifying a cDNA which hybridizes to said detectable probe; and 

d) isolating said cDNA which hybridizes to said probe. 

35 6. A method of making a cDNA comprising the steps of: 

a) contacting a collection of mRNA molecules from human cells with a first primer 
capable of hybridizing to the poly A tail of said mRNA; 

b) hybridizing said first primer to said polyA tail; 
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c) reverse transcribing said mRNA to make a first cDNA strand; 

d) making a second cDNA strand complementary to said first cDNA strand using at least 
one primer comprising at least 1 5 consecutive nucleotides of a sequence selected from the group 
consisting of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622; and 

5 e) isolating the resulting cDNA comprising said first cDNA strand and said second 

cDNA strand. 

7. A method of making a polypeptide comprising the steps of: 

a) obtaining a cDNA which encodes a polypeptide encoded by a nucleic acid comprising 
1 0 a sequence selected from the group consisting of SEQ ID NOs. 24-8 1 1 or a cDNA which encodes a 

polypeptide comprising at least 10 consecutive amino acids of a polypeptide encoded by a sequence 
selected from the group consisting of SEQ ID NOs. 24-8 11; 

b) inserting said cDNA in an expression vector such that said cDN A is operably linked to 

a promoter; 

15 c) introducing said expression vector into a host cell whereby said host cell produces the 

protein encoded by said cDNA; and 
d) isolating said protein. 

8. In an array of discrete ESTs or fragments thereof of at least 15 nucleotides in length, the 
20 improvement comprising inclusion in said array of at least one sequence selected from the group 
consisting of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622, the sequences complementary to 
the sequences of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622 and fragments comprising at 
least 15 consecutive nucleotides of said sequence. 



25 



30 



9. The array of Claim 8 including therein at least five sequences selected from the group 
consisting of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622, the sequences complementary to 
the sequences of SEQ ID NOs. 24-81 1 and SEQ ID NOs. 1600-1622 and fragments comprising at 
least 15 consecutive nucleotides of said sequences. 

10. An enriched population of recombinant nucleic acids, said recombinant nucleic acids 
comprising an insert nucleic acid and a backbone nucleic acid, wherein at least 5% of said insert 
nucleic acids in said population comprise a sequence selected from the group consisting of SEQ ID 
NOs. 24-81 1 and SEQ ID NOs. 1600-1622, the sequences complementary to SEQ ID NOs. 24-81 1 
and SEQ ID NOs. 1600-1622 and fragments comprising at least 15 consecutive nucleotides of said 

35 sequences. 

1 1 . An antibody composition capable of selectively binding to an epitope-containing 
fragment of a polypeptide comprising a contiguous span of at least 8 amino acids of any of SEQ ID 
NOs. 812-1599, wherein said antibody is polyclonal or monoclonal. 
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12. A computer readable medium having stored thereon a sequence selected from the group 
consisting of a nucleic acid code of SEQ ID NOs. 24-81 1 and 1600-1622 and a polypeptide code of 
SEQ ID NOs. 812-1599. 

5 13. A computer system comprising a processor and a data storage device wherein said data 

storage device has stored thereon a sequence selected from the group consisting of a nucleic acid 
code of SEQID NOs. 24-81 1 and 1600-1622 and a polypeptide code of SEQ ID NOs. 812-1599. 

10 14. The computer system of Claim 13 further comprising a sequence comparer and a data 

storage device having reference sequences stored thereon. 

15. The computer system of Claim 14 wherein said sequence comparer comprises a 
computer program which indicates polymorphisms. 

^ 16. The computer system of Claim 13 further comprising an identifier which identifies 

features in said sequence. 

17. A method for comparing a first sequence to a reference sequence wherein said first 
20 sequence is selected from the group consisting of a nucleic acid code of SEQID NOs. 24-8 1 1 and 

1600-1622 and a polypeptide code of SEQ ID NOs. 812-1599 comprising the steps of: 

a) reading said first sequence and said reference sequence through use of a computer 
program which compares sequences; and 

b) determining differences between said first sequence and said reference sequence with 

25 said computer program. 

18. The method of Claim 17, wherein said step of determining differences between the first 
sequence and the reference sequence comprises identifying polymorphisms. 

30 19. A method for identifying a feature in a sequence selected from the group consisting of a 

nucleic acid code of SEQID NOs. 24-81 1 and 1600-1622 and a polypeptide code of SEQ ID NOs. 

812-1599 comprising the steps of: 

a) reading said sequence through the use of a computer program which identifies features in 

sequences; and 

35 b) identifying features in said sequence with said computer program. 

20. A vector comprising a nucleic acid according to either Claims 1 or 2. 



2 1 . A host cell containing a nucleic acid of Claim 20. 

40 
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1. Claims: Invention 1: 1-21 all partially 

Nucleic acid comprising a sequence as in Seq.ID.No. 24, 
complementary sequence and fragments thereof. Polypeptide, 
Seq Id. No. 812, encoded by said nucleotide sequence Vector 
comprising Seq.Id.No. 24 and host cell comprising the 
vector. Methods of making cONA and polypeptide utilising 
Seq.Id.No. 24. Array of ESTs comprising Seq.Id.No. 24, or a 
fragment thereof. An antibody binding to an epitop of the 
polypeptide of Seq.Id.No. 812. A computer readable medium 
and a computer system storing and/or utilising the sequence 
of Seq.Id.No. 24 or 812. 



2. Claims: Invention 2-811 : 1-21 all partially 

Idem as subject 1 but limited to each of the DNA sequences 
as in Seq.Id.No. 25-811 and 160O-1622, and corresponding 
polypeptides when applicible, where invention 2 is limited 
to Seq.Id.No. 25 and 813, invention 3 is limited to 

Seq.Id.No. 26 and 814 , invention 788 is limited 

to Seq.Id.No. 811 and 1599, invention 789 is limited to 
Seq.Id.No. 1600, invention 790 is limited to Seq.Id.No. 

iSQi , invention 811 is limited to Seq.Id.No. 

1622! 
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